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Abstract: Fast Multipole Methods (FMM) are a fundamental operation for the simulation 
of many physical problems. The high performance design of such methods usually requires to 
carefully tune the algorithm for both the targeted physics and the hardware. In this paper, we 
propose a new approach that achieves high performance across architectures. Our method consists 
of expressing the FMM algorithm as a task flow and employing a state-of-the-art runtime system, 
StarPU, in order to process the tasks on the different processing units. We carefully design the task 
flow, the mathematical operators, their Central Processing Unit (CPU) and Graphics Processing 
Unit (GPU) implementations, as well as scheduling schemes. We compute potentials and forces 
of 200 million particles in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and of 38 
million particles in 13.34 seconds on a heterogeneous 12 cores Intel Nehalem processor enhanced 
with 3 Nvidia M2090 Fermi CPUs. 

Key-words: Fast multipole methods, graphics processing unit, heterogeneous architectures, 
runtime system, pipeline 
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Pipeline de la methode multipole rapide sur un moteur 

d'execution 



Resume : Les methodes multipoles rapides (FMM) sont une operation fondamentale pour 
la simulation de nombreux problemes physiques. Leur mise en ceuvre haute performance re- 
quiert habituellement d'optimiser attentivement l'algorithme a la fois pour la physique visee et 
la materiel utilise. Dans ce papier, nous proposons une nouvelle approche qui atteint une perfor- 
mance elevee et portable. Note methode consiste a exprimer l'algorithme FMM comme un Hot de 
taches et d'employer un moteur d'execution, StarPU, afin de traiter les taches sur les differentes 
unites d'execution. Nous concevons precisement le hot de taches, les operateurs mathematiques, 
leur implementations sur unite centrale de traitement (CPU) et processeur graphique (GPU) 
ainsi que les schemas d'ordonnancement. Nous calculons les potentiels et forces pour des prob- 
lemes de 200 millions de particules en 48.7 secondes sur une machine homogene SGI Altix UV 
100 comportant 160 cceurs et de 38 millions de particules en 13.34 secondes sur une machine 
heterogene composee d'un processeur Intel Nchalem accelere avec 3 GPUs Nvidia Fermi M2090. 

Mots-cles : Methodes multipoles rapides, processeur graphique, architectures heterogenes, 
moteur d'execution, pipeline 
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1 Introduction 



Pair- wise particle interactions play an important role in many physical problems. Examples 
are astrophysical simulations, molecular dynamics, the boundary element method, radiosity in 
computer-graphics and dislocation dynamics. In the last decades numerous algorithms have 
been developed in order to reduce the quadratic complexity of a direct computation. The fast 
multipole method (FMM), first presented in [JJ, are probably the most prominent one. The 
fact that they have linear complexity makes them candidates of first choice for processing large- 
scale simulations of physical problems Thus, the design of efficient FMM implementations 
is crucial for the high performance computing (HPC) community. They have been ported to 
multicore processors |3J 0] [5J |B] , graphical processing units (GPUs) clusters [71 151 [51 ITUl ITTl 112"] . 
Because these codes were tightly coupled with the targeted architectures, they could achieve 
excellent performance, sometimes beyond 10 6 floating point operations per second ( gf1o p/s). On 
the other hand, porting a code from one architecture to another was a commitment requiring very 
important (valuable) human resources. In this paper, we consider an alternative approach for 
achieving high performance across architectures. For that purpose, we turn the FMM algorithm 
into a task flow and employ a runtime system in order to dispatch the tasks on the actual 
computational units. 

Processing HPC algorithms over runtime systems was successfully studied in the case of 
dense linear algebra algorithms during the past five years [131 IT41 IT51 [TBI [T71 [T51 HU1 EU] and 
is now a common utility for related state-of-the-art libraries such as Plasma [UJ, Magma [2"2~] 
and Flame [23J. Dense linear algebra were excellent candidates for pioneering this path. First, 
the related task graph is very wide and therefore allows for many computational units to run 
concurrently. Second, the relative regularity of the tasks makes them particularly easy to schedule 
and achieve optimum performance |24| . On the contrary, pipelining the FMM over a runtime 
system is much more challenging. Indeed, the pattern of an FMM task flow is much more 
irregular and, hence, involving to handle. Moreover, the nature of the different computational 
steps and the subsequent granularity of tasks add complexity for ensuring an efficient scheduling. 
To tackle these challenges, we carefully analyze the FMM task flow such that it can be efficiently 
pipelined, we define fast mathematical operators and implement them on central processing units 
(CPU) and graphics processing units (GPUs), and we construct empirical performance models 
together with appropriate scheduling algorithms. 

We propose an innovative methodology for designing HPC FMM codes. To our knowledge, 
only two other (very recent and not yet published) attempts have been made for processing the 
FMM over runtime systems. Ltaief and Yokota [55] have studied the feasibility of the approach 
on a 16 cores homogeneous machine. Bordage |26_ obtained preliminary results for the Hclmholtz 
kernel on heterogeneous architectures. In the present work, we not only show that the FMM 
can be highly pipelined, but also obtain performance numbers across architectures which are 
comparable to well established and heavily tuned methods for specific architectures [HI HH HU IS] . 

The paper is organized as follows. In Section [2] we briefly introduce the FMM algorithm in 
use and present an improved M2L kernel. In Section [3] we introduce the StarPU [57] runtime 
system and explain how to build a naive FMM task flow to be processed by the runtime system. 
In Section [4] we carefully design an improved task flow and present scheduling strategies for 
homogeneous architectures and assess their impact on multicore platforms. Finally, we tackle 
heterogeneous architectures in Section [5] before concluding in Section [7] 
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2 Fast Multipole Method 

Pair-wise particle interactions can be modeled mathematically as 

N 

I, ^ /'(,,-,..,/, nr., for i = l,...,M, (1) 

i=i 

Here, pairs of particles, in the following denoted as sources and targets, are represented by 
their spatial position x,y £ M 3 , respectively. The interaction is governed by the kernel function 
P(x,y). The above summation can also be understood as matrix- vector product f = Pw, where 
P is a dense matrix of size M x N. Hence, if we assume M ~ N, the cost grows like 0(N 2 ). 
The FMM reduces the cost of computing such summations to O(N). Its idea is to approximate 
the kernel function by a low-rank representation whose error decays exponentially if sources and 
targets are well separated (far-field). If sources and targets are not well separated (near-field) 
the kernel function will be used in its original form. These facts are exploited by hierarchically 
partitioning the computational domain into an octree (in R 3 ) an subsequently identifying the 
near and the far-field. Finally, we end up with a data-sparse approximation of the dense matrix 
P. 



2.1 The black-box FMM algorithm 

Many FMM algorithms are kernel specific, meaning, they require a distinct analytic treatment 
for each kernel function. Our approach is adopted from [55] and can deal with broad range of 
kernel functions. Examples for the Laplace and the Stokes kernel and for various multiquadric 
basis functions have been presented. Another kernel independent FMM with a lower computa- 
tional cost is presented in [29 . However, it is less general, since it works only for functions which 
are fundamental solutions of second-order constant coefficient non-oscillatory elliptic partial dif- 
ferential equations. Our approach has also been extended to the oscillatory Hclmholtz kernel 
in [3U]. 

In the paper at hand we present results for the Laplace kernel function 

P(z,y) = — ^ and F{x,y) = (2) 
\x-y\ \x-y\ A 

Note, the second one can be written as F(x,y) = W x P(x,y). And we use a Chebyshev interpo- 
lation scheme to interpolate these kernel functions as 



P(x,y) ~ ^ S m (x) ^P{x m ,y n )S n (y) and 

m— 1 n— 1 

I i 

y)~/] W x S m (x) ^2 p (x™, y n )S n (y)- 



(3) 



by shifting the gradient V x from F(x, y) to the interpolation polynomial S m (x). In the following 
the interpolation polynomial is referred to as P2M (particle-to-moment), M2M (moment-to- 
moment), L2L (local-to-local) and L2P (local-to-particle) operator and the point- wise evaluated 
kernel function as the M2L (moment-to-local) operator. Note, the P2M, M2M, M2L and L2L 
operators are the same for Equation only the L2P operator differs. 

In the remainder of this section, we present technical details on how we improved the black- 
box FMM. Note, they are not required for the understanding of the main scope of the paper and 
readers not familiar with the FMM may proceed to Section [3] 
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2.1.1 Efficient implementation of the P2M, M2M, L2L and L2P operators 

In M 3 we use the tensor-product ansatz for the translation operators 

S m (x) = S^iC^i) <E> S ma (x 2 ) ® S m . 3 (x 3 ). (4) 

where each 

1 2 '~ l 

Sm^iiXi) ~^ ~\~ ~j, ^ ^ T n (x m ,j) T n (xi ) , (5) 
n— 1 

is a polynomial of order £—1 and T„(a;) with 16 [— 1, 1] are Chebyshev polynomials of first kind 
and x are interpolation points chosen to be Chebyshev roots. 

The application of the M2P and P2L operator requires 0(M£ A ) floating point operations. 
However, by exploiting the outer-product-like representation from Equation §5§, the cost can be 
reduced to 0(M£ 3 + £ 4 ). Moreover, due to the tensor-product ansatz in Equation Q the matrix- 
vector-product can be reformulated as a sequence of smaller matrix-matrix-products coupled with 
permutations (use of Blas3). In this way the cost for applying the M2M and L2L operator can 
be reduced from 0(£ 6 ) to 0(£ 4 ) only. 

2.1.2 Improved M2L operator 

Normally, in M 3 the far-field is limited to the at most 27 near-field interactions of the parent- 
cell, only. This leads to at most 189 far-field interactions for one cell. Most kernel functions 
are homogeneous (if we scale the distance between source y and target i by a factor of a the 
resulting potential is scaled by a", where n is a constant and depends on the kernel function). 
Hence, the M2L kernels of all 316 possible far-field interactions for all cells need to be computed 
only once. 

The authors of [28 proposed to compress all possible M2L operators P t with t = 1, . . . , 316, 
each of size £ 3 x £ 3 , via two big singular value decompositions (SVD). Subsequent algebraic 
transformations lead to an efficient representation as P t ~ UC t V T , where Q is a matrix of size 
r x r (low-rank obtained via the SVD based on a prescribed accuracy £svd)- The matrices U 
and V, both of size £ 3 x r, are the same for all t, hence, their application can be shifted to the 
M2M and L2L operator, respectively. 

We propose another approach. First, by exploiting symmetries, we can represent the 316 
M2L operators by permutations of 16, only. Second, we apply an individual SVD to each of the 
16 operators. In other words, instead of computing 316 matrix-vector products we compute 16 
matrix-matrix-products coupled with permutations. 



Table 1 : Comparison of both M2L optimizations 



Acc 


?-316 


weighted r 16 


cost i 6 / COSt 3 i 6 


3 


19 


4.6 


0.69 


5 


67 


11.2 


0.62 


7 


150 


22.2 


0.67 



We compare both approaches in Tab. [T] The first column shows three different accuracies of 
the method given by Acc, i.e., (£, £svd) = {Acc, 10~ Acc ). Studies in [3D] have shown that it can 
be useful to correlate £ and esvD- In other words, by solely reducing esvD or increasing £, the 
method does not give more accurate results. The accuracies Acc — 3, 5, 7 lead to an relative error 
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of el 2 ~ 10~ 5 , 1CU 7 , 1CP 9 , respectively. The second column in Tab. [T] shows the obtained low- 
rank r 316 by the first approach and the third column the weighted low-rank r 16 of our approach. 
The overall cost is given by cost3i6 = O (316 • r| 16 ) and costi6 = O (316 • 2£ 3 rio), respectively. 
Column four shows that our approach is favorable in terms of 

• reduced precomputation cost and memory requirement: 16 small SVD, each of size £ 3 x £ 3 
instead of 2 big SVDs, both of size 316 £ 3 x £ 3 

• better compression (weighted r 16 <C r 316 ) reduces cost of applying the M2L operators 

• better cache reuse: 16 matrix-matrix products instead of 316 matrix-vector products 

3 FMM over a runtime system 

A runtime system is a software component that aims at supporting the execution of an algorithm 
written in a relatively high-level language. Different runtime systems were designed to support 
accelerator-based platforms. StarSs |31| is an annotation-based language that can execute pro- 
grams either on CPUs, GPUs or Cell processors, respectively, using SMPSs, GPUSs or CellSs. 
Initially designed for distributed memory machines, DAGuE |32j has been extended to handle 
GPUs too. The Harmony runtime system proposes features similar to StarPU [33] • Sequoia |34| 
statically maps hierarchical applications on top of clusters of hybrid machines. CharmH — h can 
also support GPUs [35] . 

StarPU was originally designed for handling heterogeneous platforms such as multicore chips 
accelerated with GPUs. It is a natural candidate if one aims at achieving performance across 
architectures. 

3.1 StarPU runtime system 

StarPU was essentially designed to execute codes on heterogeneous platforms. The algorithm 
(FMM in our case) to be processed by StarPU is written in a high-level language, oblivious to 
the underneath platform. It is expressed as a so-called task flow consisting of a set of tasks 
with dependencies among them. Task flows can conveniently be represented as directed acyclic 
graphs (DAGs) where vertices represent individual tasks and edges dependencies among them. A 
multi-device implementation of the tasks, so-called codelet, is then provided by the programmer. 
It gathers the kernel implementations available to the different devices (CPU core and GPUs in 
our case). The data on which the tasks operate may need to be moved between the different 
computational units. StarPU ensures the coherency of these data. For that, data are registered 
to the runtime, which is accessing them not through their memory address anymore but through 
a StarPU abstraction, the handle, returned by registration. StarPU transparently guarantees 
that a task that needs to access a piece of data will be given a pointer to a valid data replicate. 
It will take care of the data movements and therefore relieve programmers from the burden of 
explicit data transfers. StarPU also detects the CPU core the closest to a GPU and selects it to 
handle that GPU. 

In the next section, we show how the FMM, usually viewed as a tree algorithm, can be turned 
into a task flow suitable for execution over a runtime system. 

3.2 FMM task flow 

We start by subdividing the computational domain hierarchically into subdomains, hereafter, 
referred to as cells. The resulting data structure is a non-directed tree as presented in Figure [Taj 
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(a) Tree view (b) Cells and operations 



Figure 1: Simplified FMM tree and subsequent operations. 



We let vertices represent cells (which contain data such as multipole or local expansions) and 



edges represent the operations (P2P, P2M, L2L, M2M, L2P) applied on them (see Figure lb). 
The tree is traversed twice. A bottom up traversal performs the P2M and M2M operations 
and a top down traversal performs the M2L followed by the L2L and L2P operations. P2P are 
usually associated to leaf cells and can be executed at any time except when the corresponding 
L2P is being executed. M2L can be represented with additional edges between siblings (see 
again Figure lb I. However, the resulting graph is neither a tree nor a DAG anymore since. The 



most straightforward way to derive a DAG from the tree would consist of mirroring the tree 
as illustrated in Figure [2a| The resulting DAG, where tasks are carried by edges, can finally 



be turned into an appropriate DAG shown in Figure 2b where tasks are carried by vertices (as 
required by the runtime system, see Section 3.1 1. 



■V.:.;. 

|l2P L2^ |l2P L2^ 




|p2M P2^ Jp2M P2rjl 
* • • • * * 

(a) Mirrored tree view 




(b) FMM task flow 



Figure 2: Simplified FMM DAG. 



3.3 Pipelining the FMM DAG 

In a naive implementation of the FMM DAG each task works on one cell. For example a P2P 
tasks computes the at most 26 near-field interactions of one cell at the leaf level, a M2M task 
computes the equivalent source values based on the source values of its 8 child cells or an M2L 
tasks computes the at most 189 far-field interactions. This approach leads to 
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• a large number of tasks (there are at most 8^ cells at level v, hence, the same number of 
tasks for all kernels) 

• a large number of dependencies (a cell has at most 189 far-field interactions or a leaf cell 
has at most 26 near-field interactions and all of them must be made available to each task) 

• a very small granularity (depends on Ace) 



4 Homogeneous case 

In the previous section, we have shown how the FMM can be turned into a task flow that can be 



executed by a runtime system. Here, we consider multicore platforms (Section 4.1 ) and design a 
variant of the task flow for which we tune the granularity of the tasks in order to efficiently exploit 
homogeneous multicore architectures. We show that very high parallel efficiency is achieved, 
which we explain thanks to a theoretical breakdown of the computational costs. 



4.1 Experimental setup 

We consider two homogeneous platforms for assessing our algorithms. The first platform is 
composed of four deca-core Intel Xeon E7-4870 processors running at 2.40 GHz (40 cores total). 
This machine is cache-coherent with Non Uniform Memory Access (ccNUMA) , which means that 
each core can access the memory available on all sockets, with a higher latency if the accessed 
data is on another socket. Each socket has 256 MB of random access memory (RAM) and a 
30 MB L3 cache. Each CPU core has its own LI and L2 caches of size 32 KB and 256 KB, 
respectively. 

The second machine is an SGI Altix UV 100. It is also a ccNUMA machine, composed of 
twenty octa-core Intel Xeon E7-8837 processors running at 2.67 GHz (160 cores total). Each 
socket has 32 GB of RAM and 24 MB of L3 cache. Each CPU core has its own LI and L2 caches 
of size 32 KB and 256 KB, respectively. 

In the rest of the paper, we will refer to these platforms as the jour deca-core Intel Xeon 
E7-4-870 and twenty octa-core Intel Xeon E7-8837 machines, respectively. 



4.2 Pipelining strategies 

The task flow proposed in Section [3] works on extremely fine grained data which induces a large 
amount of tasks. The overhead for the runtime system to handle all these tasks may become 
non negligible and produce a large penalty on the total execution time. Preliminary experiments 
(not reported here) showed that this approach does not scale. Assume, is the time needed to 
schedule a task, t e the time to execute a task and n p the number of available threads. If t e < n p ti 
then not sufficient work for all threads will be available. We overcome this problem by increasing 
the granularity of the tasks by letting them operate on groups of cells at the same level as shown 
in Figure [3] On one hand, an increased task granularity leads to a higher performance of the 




Figure 3: A tree split in groups of three elements 
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FMM kernels. On the other hand, fewer tasks are available and that limits the concurrency. 
Thus, we parameterize the granularity which allows us to find a trade-off between performance 
and concurrency. The parameter is the size n g of the group of cells we let a task work on. 
With this strategy we obtain relatively regular tasks even for very irregular problems. Moreover, 
it reduces the overall number of dependencies (see Fig. 14 for some studies). We use Morton 
ordering [36] to group cells. That is why they tend to be grouped also locally and, hence, they 
very likely share common near and far-field interactions. 



4.3 Breakdown of the computational work 

In Figure [4] we present the breakdown of the computational work for the different FMM kernels 
based on an example with N = 20 • 10 6 uniformly distributed particles, an octree of height h = 7 
and an accuracy Acc = 7. It shows the level-wise grouped far-field operators (P2M, M2M, M2L, 
L2L and L2P) and the respective number of required floating point operations. Edges indicate 
dependencies: The algorithm starts with the P2M at level 6 (the leaf level), once it is finished it 
gives the go to the M2L at the leaf level and the M2M at level 5, and so on. Evidently, the work 
of the M2L kernel is predominant. Moreover, most of the work is done by the kernels at the leaf 
level. Indeed, it decreases exponentially (we have a uniform octree; hence, by a factor of 8 per 
level) . 



Level 2 



Level 3 



Level 4 



Level 5 



Level 6 



M2M 0.0002% H M2L 0.0042% 



P2M 0.89% 



%) /L2L 



0002% 



»(lvl2L 



M2M 0.13% H M2L 6.88% 



I i 



M2L 58.75% 



M2M 0.002% > M2L 0.070% H L2L 0.002% 



M2M 0.016% M M2L 0.750% H L2L 0.016% 



L2L 0.13% 



M L2P 3.60% 



Figure 4: The breakdown of the computational work of the FMM kernels (N = 20- 10 6 uniformly 
in the unit-cube distributed particles, h — 7, Acc — 7) shows dependencies of the kernels at 
different levels and the percentage of the overall work which sums up to 2.71 • 10 12 floating point 
operations (P2P 28.75%) 



4.4 Parallel efficiency 

We now study the parallel efficiency of our approach, defined as 

- A_ 

nt n 

where t n denotes the measured execution time with n computational devices. In the Figures [5] 
and[6]we present studies obtained on the four deca-core Intel Xeon and twenty octa-core 

Intel Xeon E7-8837 machines, respectively. We present studies for uniformly and non-uniformly 
distributed particles and for different accuracies (defined as (/,£ S vd) = (Acc, 10- Acc )). All 
studies feature an extraordinary good scaling: between 80% and 98% efficiency at 40 CPUs in 
Figure[5]and between 60% and 86% efficiency at 160 CPUs in Figure[6] In order to understand the 
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10 20 30 40 

Number of CPUs 



Figure 5: Parallel efficiency on up to 40 CPUs on the four deca-core Intel Xeon E7-4870 machine 
for 20 • 10 6 particles (uniform distribution with h = 7 and non-uniform (NU) with h = 8); the 
granularity of tasks is parameterized by n g , the number of cells it works on. 



scaling behaviour more in detail we need to analyze the influence of the granularity of the tasks 
to be scheduled by the runtime system. In Section |4.2| we have parameterized the granularity of 
tasks by means of the number of cells n g they work on (see Fig. 14 for some studies). Obviously, 



efficiency is related to concurrency, which can be increased by adding another level to the tree 
or by reducing the granularity of the tasks. In the uniform case in Figure [5] the high accuracy 
example is less efficient than the other two. The reason is that it does not provide enough 
concurrency due to a larger granularity (n g = 500 compared to 250). In the non-uniform case 
the high accuracy example is the most efficient one. We have added one more level to the tree, 
hence, there is enough concurrency even for tasks of larger granularity. In Figure [6] we kept the 



0.8 



— 1 v . — — . 



• "^OrV '. '.V- • 



JV = 200 ■ 10°, Acc = 7 
■ JV = 200 ■ 10 6 , Acc = 5 



JV = 200 ■ 10 , Acc = 3 

' JV = 20 ■ 10 6 , Acc = 7 

■ ■ ■ ■ JV = 20 ■ 10 6 , Acc = 5 
JV = 20 ■ 10 6 , Acc = 3 



□ 



uu ' 100 150 

Number of CPUs 



Figure 6: Parallel efficiency on up to 160 CPUs on the twenty octa-core Intel Xeon E7-8837 
machine for uniformly distributed particles (for N = 20 • 10 6 we use n g = 500 and h = 7, for 

; the granularity of tasks is parameterized by n„, the 



N = 200 • 10 b we use n 



1000 and h 



number of cells it works on. 



granularity constant. The concurrency does not change but the granularity changes depending 
on the chosen accuracy (higher accuracy leads to larger granularity). All studies behave as 
expected, those with larger granularity are more efficient. 

Figure [8] shows the execution trace obtained by the runtime system StarPU. Each horizontal 
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lane shows the occupancy of a particular processing unit (here 40 CPUs) as a function of the 
time. As illustrated in Figure U\ red (dark) sections denote idle time, white sections denote the 




idle P2P M2L other kernels copying data 

Figure 7: Color legend for traces 



P2P kernel, light gray sections denote the M2L kernel and medium gray sections denote all other 
kernels (P2M, M2M, L2L, L2P; as can be seen in Figure [1] their work share is vanishingly small 
compared to P2P and M2L). The purple color denotes that a tasks is available but not the data. 
The trace shows a highly pipelined execution. Depending on the execution, a barrier can be 
observed before the L2P. This is due to data dependencies in the L2L meanly at leaves level. 




Figure 8: Trace for the study in Figure 5 with N — 20 • 10 6 and Acc = 7 (execution time t = 23 s). 



5 Heterogeneous case 

If CPUs are now available on the computational node, the execution of a task can be deported 



on them by the runtime system (see Section 3.1). As discussed above, most FMM codes for 



heterogeneous machines decide how to distribute the work load before starting the actual parallel 
execution. With such a static approach, it is very hard to consistently achieve efficient load 
balancing. StarPU allows for a dynamic distribution of the work load based on given scheduling 
strategies (see [37 ). We use the Eager scheduler for all computations. Other schedulers are 
available on StarPU, but they are not yet considered as stable. Some tests have been performed 
with the Heft scheduler which is able to estimate and use the duration of tasks to schedule 
them optimally. Such estimation is based on preliminary runs where the duration of tasks is 
stored in a database. In Fig. [9] calibration results for the Heft scheduler of the P2P kernel are 
presented. During two preliminary runs the duration of all P2P tasks as a function of the number 
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of interactions are stored in a database and based thereon the scheduler is able to estimate the 
duration of tasks in future runs. 
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Figure 9: Calibration results of the Heft scheduler after two simulations for the CPU and the 
three GPU. 



In Section |5.2| we consider executions where near-field (P2P) and far-field (other kernels) 
interactions are well balanced. Relying on a high performance blocked P2P kernel, we show that 
processing P2P (and only P2P) on GPUs allows for achieving well balanced executions. We then 
consider the case where the near-field is dominant in Section 15.31 and we show that we continue 
to achieve a good load balance by scheduling P2P tasks on demand on CPU or GPU. Finally, we 
study the case where the far-field is dominant in Section |5.4| By relying on a M2L GPU kernel 
(not yet optimized), we manage to maintain a high pipeline and well balanced execution. 



5.1 Experimental setup 

We consider a dual-socket hexa-core host machine based on Intel X5650 Nehalem processors 
operating at 2.67 GHz. Each socket has 12 MB of L3 cache and each core has 256 KB of L2 cache. 
The size of the main memory is 48 GB. Moreover, we have two different GPU sets associated to it. 
They consist of three Nvidia M2070, respectively, M2090 Fermi accelerators and are connected 
to the host with a 16x PCI bus. Each GPU has 6 GB of GDDR-5 of memory and 14 processors 
(448, respectively 512 cores), operating at 1.15 GHz. In the rest of the paper, we will refer to 
this platforms as the Nehalem- Fermi (M2070), respectively, Nehalem-Fermi (M2090) machine. 
If we do not need to explicitely refer to one of these clusters we simply call them heterogeneous 
machines since they are composed of CPUs and GPUs. Because StarPU dedicates a CPU core 



in order to handle a GPU (see Section 3.1 1, both machine will be viewed as a nine CPU cores 
node enhanced with three GPUs. 



5.2 Balancing near and far-field 

We consider executions where near-field (P2P) and far-field (other kernels) computational costs 
are well balanced. Relying on a high performance blocked P2P kernel, we design a first scheme 
that processes all P2P (and only P2P) on GPUs and performs highly pipelined executions. 

We have derived from [TU] a GPU kernel that operates on groups of n g cells. Figure 9] 
compares the execution times of our CPU and GPU P2P kernels. Because we group multiple 
cells to be processed into a single kernel call, not all particles interact directly with each other. 
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The time spent for executing this kernel is thus neither proportional to the number of cells nor 
particles. Figure [9] indeed shows that the execution time of the kernel is proportional to the 
number of effective interactions (x-axis). 

Our main goal is to obtain perfectly piplined executions. Hence, we choose the reference 
example settings such that the computational load is well balanced between the three available 
CPUs and nine available CPUs. Initially we process the near-field (P2P kernels) on the CPUs 
and the far-field (all other kernels) on the CPUs. In order to end up with a roughly matching 
execution time we choose N — 38 • 10 6 , Acc — 5 and h = 7. Moreover, we chose a large block 
size (n g = 1500) in order to achieve a high flop-rate for the P2P GPU kernel (102 gf1o p/s). 

Figure [TO] shows the execution trace of the task flow proposed in Section [4] on Nehalem- Fermi 
(M2070), the heterogeneous machine presented in Section 5.1 We recall that for the moment 



we force all P2P tasks to be executed on GPUs. As expected, they are perfectly pipelined (last 
three lanes in Figure 10 ) since they are all independent one from another. However, CPU cores 



(first nine lanes) are not fully occupied (purple color means that a task is available but its data 
not). The reason is that P2P tasks, executed on GPUs, are followed by P2Preduce tasks which 
are executed on CPUs. What is the P2Preduce tasks for? It basically exploits the fact that if 
source and target particles are the same the resulting matrix becomes symmetric. Hence, we 
do not only write the resulting potential and forces on the target particles but also on source 
particles. The P2Preduce tasks perform this data reduction and cannot be executed at the same 
time as the corresponding L2P tasks. 




Figure 10: Execution trace on the Nehalem- Fermi (M2070) heterogeneous machine (N = 38- 10 6 



Acc — 5, h = 7, n„ 



1500) with balanced near and far-field. Each P2P task is followed by the 



P2Preduce tasks. The execution time is t 
occupancy and the three last ones GPUs. 



18.50 s. The nine first lanes represent CPU cores 



Finally, by forcing P2Prcduce tasks to be processed after L2P tasks and prefetching all data 
movements between CPUs and GPUs, we obtain a perfect pipeline (see Figure 111. It reduces 
the walltime from 18.50 s to 14.78 s. 



5.3 Dominant near-field 

Next, we consider an execution with a lower accuracy {Acc — 3). M2L computational cost is 
reduced (M2L is the most sensitive kernel to the accuracy) and therefore P2P strongly dominates. 
If P2P is fully processed on GPU, CPUs remain idle most of the time (trace not reported here). 
We therefore allow for scheduling dynamically P2P on CPU or GPU. In order ensure prefetching, 
the scheduling decision has to be taken ahead of time (so that data have time to be moved between 
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(a) Nehalem- Fermi (M2070) (t = 14.78 s) 



(b) Nehalem- Fermi (M2090) (t = 13.34 s) 



Figure 11: Execution trace on both heterogeneous machines (N = 38 • 10 6 , Acc = 5, h = 7, 
n g = 1500) with balanced near and far-field. Data from GPU are prefetched and P2Preduce 
tasks are performed last. The nine first lanes represent CPU cores occupancy and the three last 
ones CPUs. 



the different computational units). However, because of the heterogeneity of the computational 
power of CPUs and GPUs, this may still lead to unbalanced executions (Figure [l2j By limiting 
the anticipation of such decisions, we managed to balance correctly the load (trace not reported 
here). 



Figure 12: Execution trace on the Nehalem- Fermi (M2070) heterogeneous machine with a dom- 
inant near-field (N — 38 • 10 6 , Acc — 3, h — 7, n g = 1500). The execution time is t ~ 19.12s. 
The nine first lanes represent CPU cores occupancy and the three last ones GPUs. 



5.4 Dominant far-field 

We finally, by increasing the accuracy (Acc = 7), we consider the case where the far-field dom- 
inates. Since M2L tasks become dominant, the computation is unbalanced (Figure 13a I. Thus, 
we have designed a (non optimized) GPU M2L kernel in order to show that we can efficiently 
balance the load by scheduling M2L tasks on demand (P2P tasks are processed on GPUs). Fig- 
ure [l3b] shows the resulting trace. The pipeline is again very efficient and the load well balanced 
(but the performance of the M2L kernel would remain to be optimized) . 
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(a) No M2L tasks on GPU (t = 49.02 s) (b) Some M2L tasks on GPU (t = 42.09 s) 



Figure 13: Execution trace on Nehalem-Fermi (M2070) with a dominant far-field (N = 38 • 10 6 , 
Acc = 7, h = 7, n g = 1500). The nine first lanes represent CPU core occupancy and the three 
last ones the GPU core occupancy. 



6 Performance benchmarks 



Several parameters influence the achieved flop-rates. One is the group-size n g , but Fig. 14 shows 
that its influence is less critical. In other words, a broad range of n g leads to the same good 
flop-rates. Interesting is the benchmark on the 20 octa-core Intel Xeon machine for N — 38- 10 6 : 
it is best performing for 1000 < n g < 3000. This is a rather small interval compared to the other 
three benchmarks. The reason is that larger n g does not provide enough parallelism anymore. 



In Fig. 15 we show flop-rates studies for uniform (in the unit-cube) and in Fig. 16 for non- 
uniform (on the unit-sphere) particle distributions. The number of particles is in either case 
N = 38 ■ 10 6 . We vary the octree height h = 6, 7, 8. By doing so we can redeploy work between 
near- and far-field. A smaller h means more near- and less far-field and a larger h the other way 
around. Moreover, we vary the accuracies Acc = 2, ... ,7. By doing so, the cost for evaluating 
the near-field does not change but higher Acc increases the cost for evaluating the far-field. All, 
these effects can obviously be recognized in the figures [T"5| and [16] 

In the figures [T5a] |16a| and |16b| the computations are clearly dominated by the near-field, i.e., 
the P2P kernels. Except in the GPU case they are all performed on the GPU. In either case 
we observe the linear scaling for additional GPUs. In the figures [T5b| and [l6c| we have a larger 
h and after Acc — 5, respectively, Acc = 3 the computations are dominated by the far-field, 
i.e., mainly the M2L kernels. They are scheduled by StarPU dynamically to available CPUs and 
GPUs. However, since our GPU implementation of the M2L kernel is not yet optimized, the 
flop-rate breaks down once the far-field becomes dominant . In Fig. |15c| we have the uniform 
case with h = 8, which is clearly to large (about 8 7 leafs which contain only about 18 particles). 



The flop-rate is one order of magnitude smaller then for h = 6, 7. In Fig. 15c the examples for 
Acc = 6, 7 have not been computed. 

What is the main difference between uniform and non-uniform particle distributions? In the 
uniform case particles are distributed in a tree-dimensional subset of K 3 , in the non-uniform 
case they are distributed in a two-dimensional subset of M 3 , only. This leads to an octree in the 
uniform case and to a quadtree in the non-uniform case. Since N is the same and the convex-hull 
is of the same order in both cases, at a given level of the tree a non-empty cluster in the non- 
uniform case tends to contain an order of magnitude more particles compared to the uniform 
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Figure 14: Flop-rates on three different computers for different group sizes n g < 12000 (particles 
per group) for Acc — 5 and h — 7 (uniform particle distribution) 



case. That is why if we compare Fig. |15a| and Fig. |16b| we obtain the same flop-rates even though 
h is different. 

In Fi g. [P7| we compare absolute timing results corresponding to the flop-rates from Fig. [15] 
and Fig. |16| We omit results for the uniform case with h = 8. In Fig. |17a| and |17b| the times 
for the non-uniform case and h = 6 (dominated by near-field) are outside of the presentable 
range. All figures show that in the uniform case a octree height of h — 7 leads to the shortest 
computation times. In the non-uniform case for accuracies Acc < 6 an octree h = 8 otherwise 
h = 7 provides the best setting. 

Figure [18] addresses the scaling behavior of our dynamically scheduled FMM implementation 
in the case of heterogeneous architectures. We take the results from Figure [T7] ch oose the minimal 
timings for each h and plot them over Acc. In the uniform case (Figure 18a I they correspond 
to h = 7. In the non-uniform case (Figure 18b I and with and 1 GPU they correspond to 
h = 7, too. However, with 2, 3 GPUs and up an accuracy Acc — 5, 6 an octree height h = 8 
and for higher accuracies h — 7 leads to minimal timings. As expected, we get excellent scaling 
for low accuracies (Acc < 5) where the near-field is dominating (we have an optimized GPU 
implementation of the P2P kernel). For example, let us look at Acc = 3: without GPU the 
computation takes 120.8s, with 1,2,3 GPUs it takes 44.1s, 22.1s and 14.7s, respectively. The 
scaling for high accuracies can be improved by implementing optimized far-field kernels (P2M, 
M2M, M2L, L2L and L2P). 

A comment on the 1 GPU case in Figure [18b] the timings for Acc = 2, 3 are not as expected. 
They should be the same as for Acc = 4, 5, because the near-field is dominating up to Acc < 5. 
In our understanding the reason for that is the fact that the Eager scheduler (see Section [5| has 
problems to schedule very small tasks correctly. However, by using the Heft scheduler we were 
able to reproduce the expected results. 
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Figure 15: Flop-rates for uniform distribution (in the unit-cube) of 38 • 10 6 particles and group 
size n g = 1000 
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Figure 17: Comparison of timing results (logarithmic scale) for uniform and non-uniform distri- 
bution of TV = 38 • 10 6 particles on 0, 1, 2 or 3 GPUs and n g = 1000 
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7 Conclusion 



We have proposed an original FMM implementation over a runtime system. Thanks to a break- 



down of the computational work (Section 4.3 1, we have shown that FMM presents excellent 



properties for being efficiently pipelined. We have succeeded to design a task flow that can 
exploit these properties and achieve a very high parallel efficiency up to 160 CPU cores on a 



SGI Altix UV 100 machine (Section 4.4). Because almost the entire work load is shared by P2P 



and M2L tasks (Section |4.3[), we have implemented a GPU version of both these kernels. We 



have shown that when the near-field is dominant (Section 5.3 1 or comparable to the far-field 



(Section 5.2 1, we manage to consistently pipeline the task flow and achieve very well balanced 
executions by deporting P2P to GPU on demand. Thanks to a highly optimized P2P kernel, we 
have achieved very high performance. For instance, potentials and forces for 38 million particles 
could be processed in 13.34 s on our 12 cores Nehalem processor enhanced with 3 M2090 Fermi 
GPUs. When the far-field is dominant, some M2L tasks also have to be deported to GPUs in 
order to balance the load. We have shown that an efficient pipeline could be performed even in 
that case (although the impact on performance itself is less impressive since we did not optimize 
the GPU M2L kernel). 

The successive optimizations discussed in this study result in a single code that gets highly 
pipelined and balanced across architectures. Since we rely on the so-called black-box method, 
our code furthermore has a broad range of applications (Section |2.1| . 

We plan to apply this approach to clusters of heterogeneous nodes. One possibility will 
consist of statically distributing the work load to the different nodes of the cluster and explicitly 
handling the inter-node communications. Alternatively, we might also let the runtime handle 
these communications thanks to the task flow. We also plan to design optimized GPU kernels 
for the six FMM tasks. Indeed, if all GPU kernels are provided to the runtime, it can limit 
data movement by processing on individual GPUs connected subparts of the task flow. It will 
be interesting to assess whether such an approach allows for a better strong scaling. 
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